roc_auc_score (ROC AUC)#
roc_auc_score computes the area under the ROC curve. It evaluates how well a model ranks positive examples above negative examples, using scores (probabilities or decision values), not hard class labels.
You will learn
How thresholds produce points on the ROC curve (TPR vs FPR)
Two equivalent AUC formulas: trapezoid area and Mann–Whitney (rank) view
A NumPy implementation of
roc_curve+roc_auc_score(tie-safe)How to optimize for AUC with a differentiable pairwise surrogate (NumPy)
Quick import#
from sklearn.metrics import roc_auc_score
Prerequisites#
Binary classification labels (0/1)
Confusion matrix terms: TP / FP / TN / FN
Basic probability and calculus
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
from plotly.subplots import make_subplots
from sklearn.metrics import average_precision_score
from sklearn.metrics import roc_auc_score as skl_roc_auc_score
from sklearn.metrics import roc_curve as skl_roc_curve
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
np.set_printoptions(precision=4, suppress=True)
SEED = 42
rng = np.random.default_rng(SEED)
1) From scores to TPR/FPR (one threshold)#
Assume:
true labels: \(y_i \in \{0,1\}\)
model scores (higher = more positive): \(s_i \in \mathbb{R}\)
threshold: \(\tau\)
We predict positive if:
From the confusion matrix at threshold \(\tau\):
TPR = recall / sensitivity
FPR = 1 - specificity
def confusion_at_threshold(y_true, y_score, threshold, pos_label=1):
y_true = np.asarray(y_true)
y_score = np.asarray(y_score)
if y_true.shape[0] != y_score.shape[0]:
raise ValueError("y_true and y_score must have the same length.")
pos = y_true == pos_label
pred_pos = y_score >= threshold
tp = np.sum(pos & pred_pos)
fp = np.sum(~pos & pred_pos)
fn = np.sum(pos & ~pred_pos)
tn = np.sum(~pos & ~pred_pos)
return tp, fp, tn, fn
def tpr_fpr_from_confusion(tp, fp, tn, fn):
tpr = tp / (tp + fn) if (tp + fn) > 0 else np.nan
fpr = fp / (fp + tn) if (fp + tn) > 0 else np.nan
return tpr, fpr
y_true_small = np.array([1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0])
y_score_small = np.array([0.95, 0.90, 0.80, 0.75, 0.60, 0.55, 0.52, 0.50, 0.40, 0.35, 0.30, 0.10])
threshold = 0.50
tp, fp, tn, fn = confusion_at_threshold(y_true_small, y_score_small, threshold=threshold)
tpr, fpr = tpr_fpr_from_confusion(tp, fp, tn, fn)
df_small = pd.DataFrame({"y_true": y_true_small, "score": y_score_small})
df_small = df_small.sort_values("score", ascending=False).reset_index(drop=True)
df_small[f"y_pred(score \u2265 {threshold:.2f})"] = (df_small["score"] >= threshold).astype(int)
df_small, {"TP": tp, "FP": fp, "TN": tn, "FN": fn, "TPR": tpr, "FPR": fpr}
( y_true score y_pred(score ≥ 0.50)
0 1 0.95 1
1 0 0.90 1
2 1 0.80 1
3 0 0.75 1
4 1 0.60 1
5 0 0.55 1
6 0 0.52 1
7 1 0.50 1
8 0 0.40 0
9 1 0.35 0
10 0 0.30 0
11 0 0.10 0,
{'TP': 4, 'FP': 4, 'TN': 3, 'FN': 1, 'TPR': 0.8, 'FPR': 0.5714285714285714})
2) ROC curve: sweep the threshold#
The ROC curve plots \((\mathrm{FPR}(\tau), \mathrm{TPR}(\tau))\) as we move the threshold \(\tau\) from very strict to very lenient:
\(\tau = +\infty\) ⇒ predict nothing positive ⇒ (FPR,TPR) = (0,0)
\(\tau\) decreases ⇒ more predicted positives ⇒ move up/right
\(\tau = -\infty\) ⇒ predict everything positive ⇒ (1,1)
A random ranking gives the diagonal line \(\mathrm{TPR} = \mathrm{FPR}\).
def roc_curve_bruteforce(y_true, y_score, pos_label=1):
y_true = np.asarray(y_true)
y_score = np.asarray(y_score)
thresholds = np.r_[np.inf, np.sort(np.unique(y_score))[::-1]]
fpr = []
tpr = []
for thr in thresholds:
tp, fp, tn, fn = confusion_at_threshold(y_true, y_score, thr, pos_label=pos_label)
tpr_i, fpr_i = tpr_fpr_from_confusion(tp, fp, tn, fn)
fpr.append(fpr_i)
tpr.append(tpr_i)
return np.asarray(fpr), np.asarray(tpr), thresholds
fpr_b, tpr_b, thr_b = roc_curve_bruteforce(y_true_small, y_score_small)
auc_b = np.trapz(tpr_b, fpr_b)
df_roc_small = pd.DataFrame({"threshold": thr_b, "fpr": fpr_b, "tpr": tpr_b})
df_roc_small
| threshold | fpr | tpr | |
|---|---|---|---|
| 0 | inf | 0.000000 | 0.0 |
| 1 | 0.95 | 0.000000 | 0.2 |
| 2 | 0.90 | 0.142857 | 0.2 |
| 3 | 0.80 | 0.142857 | 0.4 |
| 4 | 0.75 | 0.285714 | 0.4 |
| 5 | 0.60 | 0.285714 | 0.6 |
| 6 | 0.55 | 0.428571 | 0.6 |
| 7 | 0.52 | 0.571429 | 0.6 |
| 8 | 0.50 | 0.571429 | 0.8 |
| 9 | 0.40 | 0.714286 | 0.8 |
| 10 | 0.35 | 0.714286 | 1.0 |
| 11 | 0.30 | 0.857143 | 1.0 |
| 12 | 0.10 | 1.000000 | 1.0 |
point_labels = ["inf" if np.isinf(t) else f"{t:.2f}" for t in thr_b]
fig = make_subplots(
rows=1,
cols=2,
subplot_titles=("ROC curve (toy example)", "TPR/FPR vs threshold"),
)
fig.add_trace(
go.Scatter(
x=fpr_b,
y=tpr_b,
mode="lines+markers",
name=f"ROC (AUC={auc_b:.3f})",
),
row=1,
col=1,
)
fig.add_trace(
go.Scatter(
x=[0, 1],
y=[0, 1],
mode="lines",
line=dict(dash="dash", color="black"),
name="random",
),
row=1,
col=1,
)
fig.add_trace(
go.Scatter(
x=fpr_b,
y=tpr_b,
mode="markers+text",
text=point_labels,
textposition="top center",
marker=dict(size=8),
name="thresholds",
),
row=1,
col=1,
)
mask = np.isfinite(thr_b)
fig.add_trace(
go.Scatter(
x=thr_b[mask],
y=tpr_b[mask],
mode="lines+markers",
name="TPR",
),
row=1,
col=2,
)
fig.add_trace(
go.Scatter(
x=thr_b[mask],
y=fpr_b[mask],
mode="lines+markers",
name="FPR",
),
row=1,
col=2,
)
fig.update_xaxes(title_text="FPR", range=[0, 1], row=1, col=1)
fig.update_yaxes(title_text="TPR", range=[0, 1], row=1, col=1)
fig.update_xaxes(title_text="threshold τ", autorange="reversed", row=1, col=2)
fig.update_yaxes(title_text="rate", range=[0, 1], row=1, col=2)
fig.update_layout(width=950, height=430)
fig.show()
3) AUC: “area” and “probability of correct ranking”#
The ROC AUC is the area under the ROC curve:
where we integrate TPR as a function of FPR.
A powerful equivalent view (binary case) is:
where \(s^+\) is the score of a random positive example and \(s^-\) is the score of a random negative example.
So AUC is a ranking metric:
any strictly monotonic transform of the score (e.g. logits → probabilities) leaves AUC unchanged
AUC = 0.5 means random ranking, AUC = 1.0 means perfect ranking
pos_scores = y_score_small[y_true_small == 1]
neg_scores = y_score_small[y_true_small == 0]
auc_pairwise = (pos_scores[:, None] > neg_scores[None, :]).mean() + 0.5 * (
pos_scores[:, None] == neg_scores[None, :]
).mean()
auc_pairwise, auc_b
(0.6571428571428571, 0.6571428571428571)
n_pairs = 500
pos_s = rng.choice(pos_scores, size=n_pairs, replace=True)
neg_s = rng.choice(neg_scores, size=n_pairs, replace=True)
df_pairs = pd.DataFrame({"neg_score": neg_s, "pos_score": pos_s})
min_s = float(min(df_pairs["neg_score"].min(), df_pairs["pos_score"].min()))
max_s = float(max(df_pairs["neg_score"].max(), df_pairs["pos_score"].max()))
fig = px.scatter(
df_pairs,
x="neg_score",
y="pos_score",
opacity=0.55,
title=(
"Random positive/negative score pairs (above diagonal = correct ranking)" f"<br>AUC ≈ {auc_pairwise:.3f}"
),
)
fig.add_shape(
type="line",
x0=min_s,
y0=min_s,
x1=max_s,
y1=max_s,
line=dict(color="black", dash="dash"),
)
fig.update_xaxes(title="negative score s⁻")
fig.update_yaxes(title="positive score s⁺")
fig.update_layout(width=650, height=520)
fig.show()
4) NumPy implementation (ROC curve + ROC AUC)#
A direct implementation by scanning all thresholds can be \(O(n^2)\).
A faster approach:
Sort examples by score (descending)
Sweep the threshold from high to low
Track cumulative TP and FP counts
Record a ROC point only when the score changes (tie handling)
This is \(O(n \log n)\) due to sorting.
def roc_curve_np(y_true, y_score, pos_label=1):
"""Compute ROC curve points (FPR, TPR) for binary classification.
Parameters
----------
y_true : array-like of shape (n_samples,)
Binary labels. Anything equal to `pos_label` is treated as positive.
y_score : array-like of shape (n_samples,)
Scores where larger means "more positive".
pos_label : label (default=1)
Which label is considered positive.
"""
y_true = np.asarray(y_true)
y_score = np.asarray(y_score)
if y_true.shape[0] != y_score.shape[0]:
raise ValueError("y_true and y_score must have the same length.")
pos = y_true == pos_label
n_pos = int(pos.sum())
n_neg = int((~pos).sum())
if n_pos == 0 or n_neg == 0:
raise ValueError("roc_curve is undefined with only one class present in y_true.")
order = np.argsort(-y_score, kind="mergesort")
y_score_sorted = y_score[order]
y_pos_sorted = pos[order].astype(int)
distinct_value_indices = np.where(np.diff(y_score_sorted))[0]
threshold_idxs = np.r_[distinct_value_indices, y_pos_sorted.size - 1]
tps = np.cumsum(y_pos_sorted)[threshold_idxs]
fps = 1 + threshold_idxs - tps
# Prepend the point at threshold +inf: (FPR,TPR) = (0,0)
tps = np.r_[0, tps]
fps = np.r_[0, fps]
thresholds = np.r_[np.inf, y_score_sorted[threshold_idxs]]
fpr = fps / n_neg
tpr = tps / n_pos
return fpr, tpr, thresholds
def roc_auc_score_np(y_true, y_score, pos_label=1):
fpr, tpr, _ = roc_curve_np(y_true, y_score, pos_label=pos_label)
return float(np.trapz(tpr, fpr))
def rankdata_average_ties(x):
"""Ranks starting at 1, using average ranks for ties (NumPy-only)."""
x = np.asarray(x)
order = np.argsort(x, kind="mergesort")
x_sorted = x[order]
ranks_sorted = np.empty_like(x_sorted, dtype=float)
n = len(x_sorted)
i = 0
rank = 1
while i < n:
j = i + 1
while j < n and x_sorted[j] == x_sorted[i]:
j += 1
# ranks for i..j-1 are rank..rank+(j-i)-1
avg_rank = 0.5 * (rank + (rank + (j - i) - 1))
ranks_sorted[i:j] = avg_rank
rank += j - i
i = j
ranks = np.empty_like(ranks_sorted)
ranks[order] = ranks_sorted
return ranks
def roc_auc_score_mann_whitney_np(y_true, y_score, pos_label=1):
"""AUC via Mann–Whitney U / Wilcoxon rank-sum (tie-safe)."""
y_true = np.asarray(y_true)
y_score = np.asarray(y_score)
if y_true.shape[0] != y_score.shape[0]:
raise ValueError("y_true and y_score must have the same length.")
pos = y_true == pos_label
n_pos = int(pos.sum())
n_neg = int((~pos).sum())
if n_pos == 0 or n_neg == 0:
raise ValueError("roc_auc_score is undefined with only one class present in y_true.")
ranks = rankdata_average_ties(y_score)
sum_ranks_pos = ranks[pos].sum()
u = sum_ranks_pos - n_pos * (n_pos + 1) / 2
return float(u / (n_pos * n_neg))
y_true = rng.integers(0, 2, size=300)
y_score = rng.normal(size=300)
auc_np = roc_auc_score_np(y_true, y_score)
auc_mw = roc_auc_score_mann_whitney_np(y_true, y_score)
auc_skl = skl_roc_auc_score(y_true, y_score)
auc_np, auc_mw, auc_skl
(0.4551141695339381, 0.4551141695339381, 0.4551141695339381)
# Our curve matches sklearn when drop_intermediate=False (sklearn defaults to drop_intermediate=True)
fpr_np, tpr_np, thr_np = roc_curve_np(y_true, y_score)
fpr_skl, tpr_skl, thr_skl = skl_roc_curve(y_true, y_score, drop_intermediate=False)
(
np.allclose(fpr_np, fpr_skl),
np.allclose(tpr_np, tpr_skl),
np.allclose(thr_np, thr_skl),
len(fpr_np),
len(skl_roc_curve(y_true, y_score)[0]),
)
(True, True, True, 301, 171)
# AUC is invariant to strictly monotonic transforms of the scores
auc_logits = roc_auc_score_np(y_true, y_score)
auc_prob = roc_auc_score_np(y_true, 1 / (1 + np.exp(-y_score)))
auc_logits, auc_prob
(0.4551141695339381, 0.4551141695339381)
5) Visual intuition: distributions → thresholds → ROC points#
Below we draw score distributions for each class and place a few thresholds. Each threshold maps to a point on the ROC curve.
n_pos, n_neg = 250, 750
scores_pos = rng.normal(loc=1.2, scale=1.0, size=n_pos)
scores_neg = rng.normal(loc=0.0, scale=1.0, size=n_neg)
y_true_big = np.r_[np.ones(n_pos, dtype=int), np.zeros(n_neg, dtype=int)]
y_score_big = np.r_[scores_pos, scores_neg]
perm = rng.permutation(len(y_true_big))
y_true_big = y_true_big[perm]
y_score_big = y_score_big[perm]
fpr, tpr, thresholds = roc_curve_np(y_true_big, y_score_big)
auc_val = roc_auc_score_np(y_true_big, y_score_big)
thresholds_demo = np.quantile(y_score_big, [0.9, 0.5, 0.1])
colors = ["#1f77b4", "#ff7f0e", "#2ca02c"]
fig = make_subplots(
rows=1,
cols=2,
subplot_titles=("Score distributions", f"ROC curve (AUC={auc_val:.3f})"),
)
fig.add_trace(
go.Histogram(
x=y_score_big[y_true_big == 0],
name="negative",
opacity=0.6,
nbinsx=40,
marker_color="gray",
),
row=1,
col=1,
)
fig.add_trace(
go.Histogram(
x=y_score_big[y_true_big == 1],
name="positive",
opacity=0.6,
nbinsx=40,
marker_color="crimson",
),
row=1,
col=1,
)
for thr, c in zip(thresholds_demo, colors):
fig.add_vline(x=float(thr), line_dash="dash", line_color=c, row=1, col=1)
fig.add_trace(go.Scatter(x=fpr, y=tpr, mode="lines", name="ROC"), row=1, col=2)
fig.add_trace(
go.Scatter(
x=[0, 1],
y=[0, 1],
mode="lines",
line=dict(dash="dash", color="black"),
name="random",
),
row=1,
col=2,
)
for thr, c in zip(thresholds_demo, colors):
tp, fp, tn, fn = confusion_at_threshold(y_true_big, y_score_big, threshold=float(thr))
tpr_thr, fpr_thr = tpr_fpr_from_confusion(tp, fp, tn, fn)
fig.add_trace(
go.Scatter(
x=[fpr_thr],
y=[tpr_thr],
mode="markers",
marker=dict(size=10, color=c),
name=f"τ={thr:.2f}",
),
row=1,
col=2,
)
fig.update_layout(barmode="overlay", width=950, height=430)
fig.update_xaxes(title_text="score", row=1, col=1)
fig.update_yaxes(title_text="count", row=1, col=1)
fig.update_xaxes(title_text="FPR", range=[0, 1], row=1, col=2)
fig.update_yaxes(title_text="TPR", range=[0, 1], row=1, col=2)
fig.show()
6) Class imbalance: ROC AUC is prevalence-invariant (PR AUC is not)#
ROC uses rates (TPR/FPR), so duplicating every negative example (same scores) leaves the curve and AUC unchanged.
Precision–recall metrics do change with prevalence, so PR AUC is often preferred for extreme imbalance.
# Duplicate negatives 10x (same scores) to change prevalence
y_true_imbal = np.r_[y_true_big[y_true_big == 1], np.repeat(y_true_big[y_true_big == 0], 10)]
y_score_imbal = np.r_[y_score_big[y_true_big == 1], np.repeat(y_score_big[y_true_big == 0], 10)]
auc_orig = roc_auc_score_np(y_true_big, y_score_big)
auc_imbal = roc_auc_score_np(y_true_imbal, y_score_imbal)
ap_orig = average_precision_score(y_true_big, y_score_big)
ap_imbal = average_precision_score(y_true_imbal, y_score_imbal)
auc_orig, auc_imbal, ap_orig, ap_imbal
(0.8279306666666666,
0.8279306666666666,
0.6333778374447112,
0.2010312536095087)
fpr_o, tpr_o, _ = roc_curve_np(y_true_big, y_score_big)
fpr_i, tpr_i, _ = roc_curve_np(y_true_imbal, y_score_imbal)
fig = go.Figure()
fig.add_trace(go.Scatter(x=fpr_o, y=tpr_o, mode="lines", name=f"original (AUC={auc_orig:.3f})"))
fig.add_trace(go.Scatter(x=fpr_i, y=tpr_i, mode="lines", name=f"negatives ×10 (AUC={auc_imbal:.3f})"))
fig.add_trace(
go.Scatter(
x=[0, 1],
y=[0, 1],
mode="lines",
line=dict(dash="dash", color="black"),
showlegend=False,
)
)
fig.update_layout(
title="ROC curves overlap under prevalence shift",
xaxis_title="FPR",
yaxis_title="TPR",
xaxis=dict(range=[0, 1]),
yaxis=dict(range=[0, 1]),
width=720,
height=450,
)
fig.show()
7) Practical usage (scikit-learn)#
Key points:
Pass scores, not hard labels.
predict_proba(X)[:, 1](probabilities)decision_function(X)(raw scores / logits)
Any monotonic transform of scores gives the same AUC.
For multiclass you must choose
multi_class="ovr"or"ovo"and an averaging strategy.
Docs: sklearn.metrics.roc_auc_score.
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
X, y = make_classification(
n_samples=2000,
n_features=10,
n_informative=5,
n_redundant=2,
weights=[0.85, 0.15],
class_sep=1.0,
random_state=SEED,
)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, stratify=y, random_state=SEED
)
clf = LogisticRegression(max_iter=2000)
clf.fit(X_train, y_train)
score_logit = clf.decision_function(X_test)
score_proba = clf.predict_proba(X_test)[:, 1]
score_label = clf.predict(X_test)
# (logits and probabilities have identical ranking → identical AUC)
skl_roc_auc_score(y_test, score_logit), skl_roc_auc_score(y_test, score_proba), skl_roc_auc_score(y_test, score_label)
(0.7989558370421088, 0.7989558370421088, 0.5415525504964054)
8) Optimizing for ROC AUC (NumPy)#
For binary labels, AUC can be written as an average over all positive–negative pairs:
This depends on pairwise orderings (rankings), which makes it:
non-decomposable over single examples
non-differentiable because of the indicator
A common workaround is to optimize a smooth pairwise surrogate. For a linear scoring model \(s(x)=w^\top x\) one choice is the pairwise logistic loss:
Minimizing \(L\) encourages \(s_i > s_j\) for positive \(i\) and negative \(j\), i.e. better AUC.
In practice we sample pairs (SGD) instead of enumerating all \(|P||N|\) pairs.
def sigmoid(z):
z = np.asarray(z)
z = np.clip(z, -40, 40)
return 1 / (1 + np.exp(-z))
def add_bias(X):
X = np.asarray(X)
return np.c_[np.ones(X.shape[0]), X]
def make_gaussian_binary(n_pos=250, n_neg=1250, seed=0):
rng_local = np.random.default_rng(seed)
mean_pos = np.array([1.5, 1.5])
mean_neg = np.array([0.0, 0.0])
cov = np.array([[1.0, 0.3], [0.3, 1.0]])
X_pos = rng_local.multivariate_normal(mean_pos, cov, size=n_pos)
X_neg = rng_local.multivariate_normal(mean_neg, cov, size=n_neg)
X = np.vstack([X_pos, X_neg])
y = np.r_[np.ones(n_pos, dtype=int), np.zeros(n_neg, dtype=int)]
perm = rng_local.permutation(len(y))
return X[perm], y[perm]
def train_logistic_logloss_gd(X, y, lr=0.2, steps=2000, l2=1e-3, log_every=50):
Xb = add_bias(X)
y = y.astype(float)
w = np.zeros(Xb.shape[1])
hist = []
for step in range(steps + 1):
scores = Xb @ w
p = sigmoid(scores)
grad = (Xb.T @ (p - y)) / len(y)
reg_grad = l2 * np.r_[0.0, w[1:]] # don't regularize bias
w -= lr * (grad + reg_grad)
if step % log_every == 0:
logloss = -(y * np.log(p + 1e-12) + (1 - y) * np.log(1 - p + 1e-12)).mean()
auc = roc_auc_score_np(y.astype(int), scores)
hist.append({"step": step, "logloss": logloss, "train_auc": auc})
return w, pd.DataFrame(hist)
def train_auc_pairwise_sgd(
X, y, lr=0.2, steps=4000, batch_pairs=512, l2=1e-3, log_every=50, seed=0
):
rng_local = np.random.default_rng(seed)
Xb = add_bias(X)
y = y.astype(int)
pos_idx = np.flatnonzero(y == 1)
neg_idx = np.flatnonzero(y == 0)
if len(pos_idx) == 0 or len(neg_idx) == 0:
raise ValueError("Need both classes to optimize AUC.")
w = np.zeros(Xb.shape[1])
hist = []
for step in range(steps + 1):
i = rng_local.choice(pos_idx, size=batch_pairs, replace=True)
j = rng_local.choice(neg_idx, size=batch_pairs, replace=True)
delta = Xb[i] - Xb[j] # x_i - x_j
d = delta @ w # (w^T x_i) - (w^T x_j)
# loss = log(1 + exp(-d))
# dloss/dd = -sigmoid(-d)
grad = -(sigmoid(-d)[:, None] * delta).mean(axis=0)
reg_grad = l2 * np.r_[0.0, w[1:]]
w -= lr * (grad + reg_grad)
if step % log_every == 0:
scores = Xb @ w
auc = roc_auc_score_np(y, scores)
pair_loss = np.log1p(np.exp(-d)).mean()
hist.append({"step": step, "pair_loss": pair_loss, "train_auc": auc})
return w, pd.DataFrame(hist)
X, y = make_gaussian_binary(seed=SEED)
# manual split (stratified-ish via shuffling; dataset is large enough here)
idx = rng.permutation(len(y))
n_train = int(0.7 * len(y))
train_idx, test_idx = idx[:n_train], idx[n_train:]
X_train, y_train = X[train_idx], y[train_idx]
X_test, y_test = X[test_idx], y[test_idx]
w_ce, hist_ce = train_logistic_logloss_gd(X_train, y_train, lr=0.3, steps=2000, log_every=50)
w_auc, hist_auc = train_auc_pairwise_sgd(
X_train, y_train, lr=0.3, steps=3000, batch_pairs=1024, log_every=50, seed=SEED
)
scores_ce_test = add_bias(X_test) @ w_ce
scores_auc_test = add_bias(X_test) @ w_auc
auc_ce_test = roc_auc_score_np(y_test, scores_ce_test)
auc_auc_test = roc_auc_score_np(y_test, scores_auc_test)
auc_ce_test, auc_auc_test
(0.9139835858585859, 0.9127604166666666)
fig = go.Figure()
fig.add_trace(
go.Scatter(
x=hist_ce["step"],
y=hist_ce["train_auc"],
mode="lines",
name="log-loss GD (train AUC)",
)
)
fig.add_trace(
go.Scatter(
x=hist_auc["step"],
y=hist_auc["train_auc"],
mode="lines",
name="pairwise AUC surrogate (train AUC)",
)
)
fig.update_layout(
title="Training AUC over iterations",
xaxis_title="step",
yaxis_title="ROC AUC",
yaxis=dict(range=[0, 1]),
width=760,
height=430,
)
fig.show()
fpr_ce, tpr_ce, _ = roc_curve_np(y_test, scores_ce_test)
fpr_auc, tpr_auc, _ = roc_curve_np(y_test, scores_auc_test)
fig = go.Figure()
fig.add_trace(
go.Scatter(
x=fpr_ce,
y=tpr_ce,
mode="lines",
name=f"log-loss GD (test AUC={auc_ce_test:.3f})",
)
)
fig.add_trace(
go.Scatter(
x=fpr_auc,
y=tpr_auc,
mode="lines",
name=f"AUC surrogate (test AUC={auc_auc_test:.3f})",
)
)
fig.add_trace(
go.Scatter(
x=[0, 1],
y=[0, 1],
mode="lines",
line=dict(dash="dash", color="black"),
showlegend=False,
)
)
fig.update_layout(
title="Test ROC curves",
xaxis_title="FPR",
yaxis_title="TPR",
xaxis=dict(range=[0, 1]),
yaxis=dict(range=[0, 1]),
width=760,
height=450,
)
fig.show()
Pros / cons / when to use#
Pros
Threshold-free: summarizes performance across all thresholds
Ranking-focused: \(\mathbb{P}(s^+ > s^-)\) interpretation is often intuitive
Invariant to monotonic score transforms (logits vs probabilities)
Less sensitive to class imbalance than accuracy (uses normalized rates)
Cons / pitfalls
Not about calibration: probabilities can be badly calibrated and still yield high AUC
Weights all FPR regions equally; if you care about tiny FPR, consider partial AUC
For extreme imbalance, PR AUC can be more informative than ROC AUC
Undefined if
y_truecontains only one class; multiclass requires design choices (ovr/ovo, averaging)
Good for
Model comparison when you care about ranking / screening
Imbalanced classification when you want a threshold-independent ranking metric
Less good for
Picking a single operating threshold under asymmetric costs
Measuring probability quality (use log-loss, Brier score, calibration curves)
Exercises#
Implement partial AUC for a max FPR (e.g. integrate only over \(\mathrm{FPR}\in[0,0.1]\)).
Extend
roc_curve_npto support sample weights.Show numerically that AUC is unchanged by any strictly monotonic transform (try
np.tanh,np.exp,sigmoid).Multiclass: compute one-vs-rest AUC for each class and compare macro vs weighted averages.
References#
scikit-learn
roc_auc_score: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.htmlscikit-learn ROC user guide: https://scikit-learn.org/stable/modules/model_evaluation.html#receiver-operating-characteristic-roc
T. Fawcett (2006), An introduction to ROC analysis
Hanley & McNeil (1982), The meaning and use of the area under a ROC curve